-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
feat(deletions): Retry deleting from Nodestore for certain Snuba errors #102391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| "deletions.nodestore.retry", | ||
| tags={"type": f"snuba-{type(snuba_error).__name__}"}, | ||
| sample_rate=1, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Test Fails Due to Unexpected Error Handling
The test_snuba_errors_retry test expects UnqualifiedQueryError("All project_ids from the filter no longer exist") to trigger a retry. However, the code specifically handles this error by logging an info metric and completing normally, which causes the test to fail and contradicts the behavior in test_deletion_with_all_projects_deleted.
❌ 2 Tests Failed:
View the top 2 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
|
|
||
| # TODO: Add specific error handling for retryable errors and raise RetryTask when appropriate | ||
| except Exception: | ||
| metrics.incr(f"{prefix}.error", tags={"type": "unhandled-exception"}, sample_rate=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This metric let me see that something changed in the last couple of days and to this Sentry issue.
| # This is not a transient error - retrying won't help since the project is permanently gone. | ||
| logger.info("All project_ids from the filter no longer exist") | ||
| # There may be no value to track this metric, but it's better to be safe than sorry. | ||
| metrics.incr(f"{prefix}.info", tags={"type": "all-projects-deleted"}, sample_rate=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reducing the metric from warning to info.
| metrics.incr(f"{prefix}.warning", tags={"type": "all-projects-deleted"}, sample_rate=1) | ||
| # When deleting groups, if the project gets deleted concurrently (e.g., by another deletion task), | ||
| # Snuba raises UnqualifiedQueryError with the message "All project_ids from the filter no longer exist". | ||
| # This happens because the task tries to fetch event IDs from Snuba for a project that no longer exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to remember that we delete from the Nodestore as a best effort since eventually the events will ttl.
Alternatively, we could make the nodestore tasks delete the project rather than in the spawning task.
| metrics.incr(f"{prefix}.error", tags={"type": "unqualified-query-error"}, sample_rate=1) | ||
| # Report to Sentry to investigate | ||
| raise DeleteAborted(f"{error.args[0]}. We won't retry this task.") from error | ||
| metrics.incr(f"{prefix}.error", tags={"type": type(error).__name__}, sample_rate=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using type(error).__name__ instead of unqualified-query-error to distinguish the errors in DD.
| logger.warning( | ||
| f"{prefix}.retry", extra={**extra, "error_type": error_type, "error": str(error)} | ||
| ) | ||
| raise RetryTask(f"Snuba error: {error_type}. We will retry this task.") from error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the new feature in this PR. I believe the errors above can be retried (we have an upper bound of how many times we would try).
| ) as error: | ||
| error_type = type(error).__name__ | ||
| metrics.incr(f"{prefix}.retry", tags={"type": f"snuba-{error_type}"}, sample_rate=1) | ||
| logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't trigger a Sentry event every time this happens since the retried task may succeed.
| raise RetryTask(f"Snuba error: {error_type}. We will retry this task.") from error | ||
|
|
||
| except Exception as error: | ||
| metrics.incr(f"{prefix}.error", tags={"type": type(error).__name__}, sample_rate=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using type(error).__name__ instead of unhandled-exception to distinguish in Datadog.
Snuba can sometimes raise errors when we're trying to remove from the Nodestore.
The changes here will retry certain of those errors.